The development of a tumor involves the evolution of both transformed soon-to-be-cancerous cells, and the the stroma in the vicinity of those cells. Junttila MR, et al. May studies have shown that the stroma in the tumor microenvironment (TME) plays a critical role in the progression of cancer, as well as its impact on cancer’s acquired resistance to drug treatment.
Unlike intrinsic resistance, which is associated with genetic and epigenetic alterations to cancer cells, acquired resistance is developed (naturally) through drug treatment; and the study of such resistance driven by highly active stroma has been the subject of increasing study in recent years.
As mutations rarely occur in the stroma themselves, researchers are interested in discerning how to manage TME-mediate drug resistance, as this will allow for the development of more effective therapeutic strategies. By identifying predictive bio markers to better target both cancer cells and the TME, clinicians could make more informed decisions when prescribing treatment to their patients to minimize the chance of inducing drug resistance in the cancer, and by extension, have a substantial impact on the prognosis of their patients.Holohan C, Van Schaeybroeck S, et al.
Cancer treatments like chemotherapy and radiation not only kill cancer cells, but can inflict substantial damage on formerly benign stromatolites cells in the TME. The process often induces cellular senescence, which in turn is strongly associated with the appearance of a senescence-associated secretory phenotype (SASP)Coppe JP, et al. SASP is itself associated with wound healing (tissue repair, the recruitment of immune cells, etc.), however, recent studies have shown that downstream SASP growth factors can promote chemoresistence of cancer cells that survive the first waves of treatment.
Sun et al. discovered that SPINK1, a serine peptidase inhibitor Kazan type 1 was found to be highly expressed in human SASP stromal cells, which had previously been identified in many human cancers, and was correlated with negative patient outcome. However, the mechanism of its expression and how specifically it impacted the pathology of human cancer had yet to be thoroughly explored. But given its connection to SASP, and it’s broad presence in other human cancers, Sun et al. determined that stromal SPINK1 would be a good candidate to investigate as a potential biomarker.
Following the demonstration of substantial alterations to pancreatic cancer cell phenotype induced by pancrine SPINK1, Sun et al.then investigated the impact of of stromal SPINK1 on cancer cell expression directly. In order to do so,they performed RNA-Seq analysis on two separate prostace cancer (PCa) cell lines, PC3, and DU145, treated either with WT PSC27 cell-based control media, or SPINK1-overexpressing PSC27 cell-based media. As both PC3 and DU145 had been shown to significantly increase in proliferation when treated with CM overexpressing SPINK1, they were condifent that such analysis would shed light on how SPINK1 would impact cancer cell expression.
For my analysis, I will perform differential expression analysis of the RNA-Seq data generated in the aforementioned study. My goal will be to confirm that SPINK1 significantly affects transcriptome-wide expression of multiple PCa cell lines, and thereby can be verified as a suitable biomarker for treatment-damaged TME.
I believe it is reasonable to suggest that I’ve accomplished my modest goal of demonstrating the pronounced affect that SPINK1 has on the expression of two different PCa cell lines. Several genes were identified showing significant amounts of up and down regulation for both DU145 and PC3, and the overlap between the two is significant as well. These are promising if crude initial results, which should be followed up with more sophisticated gene ontology analysis of the results so as to better characterize the nature of the up and down regulation in an intra-cellular context, and confirm that these observed expression modifications are indeed responsible for the aforementioned SASP type stromal cells behavior, and the development of drug resistance.
As mentioned in the introduction, the data used for this report was generated by Sun et al., and utilized in their publication: Targeting SPINK1 in the damaged tumour microenvironment alleviates therapeutic resistance. The specific dataset from the paper was titled “Transcriptomic Reprogramming of Prostate Cancer Cells Driven by Stroma-Derived SPINK1”, and can be found here. The data was generated by the Cancer Resistance Lab at the Shanghai Institute of Nutrition and Health, and first published to GEO in December of 2017.
The RNA itself was obtained from PC3 and DU145 PCa cell lines. Following coculture and media removal, the cells were lysed with Trizol reagent and total RNA was harvested.
The libraries were preped using Illumina’s TruSeq Stranded mRNA LT Sample Prep Kit (RS-122-2101, RS-122-2102) was used, with 1 ug of total RNA for the construction of sequencing libraries.
After culture had reached 80% confluency, cells were treated with media derived from either control PSC27 prostate cancer cells, or SPINK1-overexpressing PSC27 prostate cancer cells for three consecutive days.
The transcript profiles of both cancer cell lines were generated by deep sequencing, in triplicate, using Illumina HiSeq X10. This dataset was chosen at the recommendation Mervin Fansler.
I downloaded the SRA datasets using the [python-fastq-downloader]{https://github.com/erilu/python-fastq-downloader} package, and the following scripts. I first prefetch the SRAs:
#!/bin/bash -l
#SBATCH --partition=angsd_class
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --job-name=gmd4002_file_download
#SBATCH --time=24:00:00
#SBATCH --mem=12G
#SBATCH --mail-user=gmd4002@med.cornell.edu
#SBATCH --mail-type=END,FAIL
cd /athena/angsd/scratch/gmd4002/Final_Project
spack load sra-toolkit@2.10.7
# Print info about this run to a log file
echo "Starting at:" `date` >> log.txt
echo "This is job #:" $SLURM_JOB_ID >> log.txt
echo "Running on node:" `hostname` >> log.txt
echo "Running on cluster:" $SLURM_CLUSTER_NAME >> log.txt
echo "This job was assigned the temporary (local) directory:" $TMPDIR >> log.txt
runs=("SRR6423010" "SRR6423011" "SRR6423012" "SRR6423013" "SRR6423014" "SRR6423015" "SRR6423016" "SRR6423017" "SRR6423018" "SRR6423019" "SRR6423020" "SRR6423021")
for i in ${runs[@]}; do
echo $i >> log.txt
prefetch $i;
done
Once downloaded in my scratch space, I then proceeded to ddump the files from the libraries using fastq-dump. Note that the libraries themselves were paired-ended.
#!/bin/bash -l
#SBATCH --partition=angsd_class
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --job-name=gmd4002_fastq_dump_files
#SBATCH --time=24:00:00
#SBATCH --mem=45G
#SBATCH --mail-user=gmd4002@med.cornell.edu
#SBATCH --mail-type=END,FAIL
cd /athena/angsd/scratch/gmd4002/Final_Project/sra_files
spack load sra-toolkit@2.10.7
# Print info about this run to a log file
echo "Starting at:" `date` >> log.txt
echo "This is job #:" $SLURM_JOB_ID >> log.txt
echo "Running on node:" `hostname` >> log.txt
echo "Running on cluster:" $SLURM_CLUSTER_NAME >> log.txt
echo "This job was assigned the temporary (local) directory:" $TMPDIR >> log.txt
runs=("SRR6423010" "SRR6423011" "SRR6423012" "SRR6423013" "SRR6423014" "SRR6423015" "SRR6423016" "SRR6423017" "SRR6423018" "SRR6423019" "SRR6423020" "SRR6423021")
for i in ${runs[@]}; do
echo $i >> log.txt
fastq-dump --gzip --skip-technical --readids --read-filter pass --dumpbase --clip --split-files $i
echo "Done.";
done
Lastly,
#! /bin/bash -l
# Batch commands
#SBATCH --partition=angsd_class
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --job-name=FASTQC_gmd4002
#SBATCH --time=24:00:00
#SBATCH --mem=35G
#SBATCH --mail-user=gmd4002@med.cornell.edu
#SBATCH --mail-type=END,FAIL
cd /athena/angsd/scratch/gmd4002/Final_Project/runs
# Print info about this run to a log file
echo "Starting at:" `date` >> log.txt
echo "This is job #:" $SLURM_JOB_ID >> log.txt
echo "Running on node:" `hostname` >> log.txt
echo "Running on cluster:" $SLURM_CLUSTER_NAME >> log.txt
echo "This job was assigned the temporary (local) directory:" $TMPDIR >> log.txt
spack load fastqc
for file in */*.gz; do fastqc $file --extract
echo "${file##*/}" >> log.txt;
done
echo " Finished at " `date ` "!" >> log.txt
To generate a STAR index for the mapping of these libraries, I first needed to download a human reference genome and its reference annotations using wget, then using them to generate our STAR index.
#!/bin/bash -l
#SBATCH --partition=angsd_class
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --job-name=gmd4002_STAR_align_hw_6
#SBATCH --time=24:00:00
#SBATCH --mem=80G
#SBATCH --mail-user=gmd4002@med.cornell.edu
#SBATCH --mail-type=END,FAIL
set -e
cd /athena/angsd/scratch/gmd4002/Final_Project
echo "Starting at:" `date` >> log.txt
echo "This is job #:" $SLURM_JOB_ID >> log.txt
echo "Running on node:" `hostname` >> log.txt
echo "Running on cluster:" $SLURM_CLUSTER_NAME >> log.txt
echo "This job was assigned the temporary (local) directory:" $TMPDIR >> log.txt
# Get the genome reference data:
mkdir -p reference
wget -c -O reference/GRCh38.primary_assembly.genome.fa.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_39/GRCh38.primary_assembly.genome.fa.gz
wget -c -O reference/gencode.v39.annotation.gtf.gz http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_39/gencode.v39.annotation.gtf.gz
spack load star@2.7
echo "wget complete" >> /athena/angsd/scratch/gmd4002/Final_Project/log.txt
cd reference
gunzip GRCh38.primary_assembly.genome.fa.gz
gunzip gencode.v39.annotation.gtf.gz
echo "gunzip complete" >> /athena/angsd/scratch/gmd4002/Final_Project/log.txt
mkdir -p GRCh38_STARindex
STAR --runMode genomeGenerate --runThreadN 3 --genomeDir GRCh38_STARindex --genomeFastaFiles GRCh38.primary_assembly.genome.fa --sjdbGTFfile gencode.v39.annotation.gtf --sjdbOverhang 299
gzip GRCh38.primary_assembly.genome.fa
echo " Finished at " `date ` "!" >> /athena/angsd/scratch/gmd4002/Final_Project/log.txt
Note that the sjdbOverhang is set to the extremely high 299, given my initial (mistaken) impression all the way back from homework 6 that the majority of my reads from the datasets were of length 300, and ideally the parameter should be set to that value minus 1. Later checking of the fine print of my dataset showed what would have probably been obvious to someone with more experience in retrospect, that as deep as these reads were, they weren’t that deep, but rather reflections of paired-ended reads. I set the runThreadN parameter to 3 to speed up processing time, though even so, considering that this was performed on a human genome, and to such a great depth, the process took quite some time to complete.
[gmd4002@curie sra_files_backup/SRR16423013]$ less SRR6423013_pass.fastq.gz | awk '{print $3}' | head
length=300
length=300
length=300
length=300
length=300
Upon confering with Dr. Skrabanek and Mervin Fansler, it was determined that the resultant STAR index file would still be suitable for the purposes of my analysis, even if it was comically oversized.
Following the creation of the index, I then performed several rounds of utiling FastQC and MultiQC, the results of which indicated that I should apply TrimGalore to my data, as well as utilize QoRTS to be on the safe side. The code I ultimately ran will be at the bottom of this section, but I’ll reproduce the basic commands in these sections first so as to better represent the appropriate chronological order for analyisis.
I ran FastQC on my downloaded files using the following snippet:
(Leaving out the bash bits for the sake of legibility)
gz_out="/athena/angsd/scratch/gmd4002/Final_Project/fastqc"
mkdir -p $gz_out
cd /athena/angsd/scratch/gmd4002/Final_Project/sra_files_backup
spack load fastqc
fastqc -o $gz_out SRR*/*.fastq.gz
The output of that former command (as well as my QoRTS runs) are then ingested using multiqc
spack load -r py-multiqc
multiqc --title "New QC Results" SRR6423010/ SRR6423010/*.QC/ SRR6423011/ SRR6423011/*.QC/ SRR6423012/ SRR6423012/*.QC/ SRR6423013/ SRR6423013/*.QC/ SRR6423014/ SRR6423014/*.QC/ SRR6423015/ SRR6423015/*.QC/ SRR6423016/ SRR6423016/*.QC/ SRR6423017/ SRR6423017/*.QC/ SRR6423018/ SRR6423018/*.QC/ SRR6423019/ SRR6423019/*.QC/ SRR6423020/ SRR6423020/*.QC/ SRR6423021/ SRR6423021/*.QC/
As mentioned before, my initial FastQC and MultiQC results seemed to indicate that I had a high amount of repeated sequences in my results. To compensate for this problem, I used TrimGalore, which leverages Cutadapt and FastQC to deal with the presence of recognized adapters. The command I used can be seen below:
SRAS=( "SRR6423010" "SRR6423011" "SRR6423012" "SRR6423013" "SRR6423014" "SRR6423015" "SRR6423016" "SRR6423017" "SRR6423018" "SRR6423019" "SRR6423020" "SRR6423021" )
#SRAS=("SRR6423010")
for treatment in "${SRAS[@]}"; do
cd "$treatment"
echo "We are in " $treatment >> ${outrep}/april_20.txt
trim_galore --illumina --fastqc --paired $treatment"_pass_1.fastq.gz" $treatment"_pass_2.fastq.gz"
cd ..
done
Note that here, given that we know that the libraries were generated using Ilumina kits, that we can pass the –illumina and –paired tags to trim galore.
QoRTSWhile QoRTS was potentially unnecessary, I was re-running my MultiQC anyway, and wanted to see if I could get it to work correclty given my prior difficulties with it in a previous homework. Discussion of the implications of the results will be covered in a later section, but the snippit used can be seen below:
q_dir = `spack location -i qorts`
for treatment in "${SRAS[@]}"; do
cd "$treatment"
# QoRTs QC
java -Xmx16G -jar ${q_dir}/bin/QoRTs.jar QC --stranded --coordSorted --title "$treatment" "$treatment.Aligned.sortedByCoord.out.bam" ${refrep}/gencode.v39.annotation.gtf ./$treatment.QC
cd ..
done
Once satisfied with my data preprocessing, I could then run an alignment of the raw Fastq data to my previously generated index file. The process of generating this index can be seen above, and the snippet of code can be seen below:
for treatment in "${SRAS[@]}"; do
cd "$treatment"
STAR --runMode alignReads \
--runThreadN 16 \
--genomeDir /athena/angsd/scratch/gmd4002/Final_Project/reference/GRCh38_STARindex \
--readFilesIn $treatment"_pass_1_val_1.fq.gz" $treatment"_pass_2_val_2.fq.gz" \
--readFilesCommand zcat \
--outFileNamePrefix "$treatment." \
--outSAMtype BAM SortedByCoordinate
samtools index "$treatment.Aligned.sortedByCoord.out.bam"
echo "STAR for " $treatment "completed at " `date` >> ${outrep}/april_20.txt
cd ..
done
While the alignment step of this process is one of the more computationally expensive ones to perform, setting the runThreadN value to 16 might have been slight overkill.
featureCountsFollowing all these pre-processing steps, and running the STAR alignment, I then ran featureCounts, a useful tool for counting reads now that they’ve been properly mapped to relevant genomic features. The code can be seen below:
#! /bin/bash -l
# Batch commands
#SBATCH --partition=angsd_class
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=3
#SBATCH --job-name=gmd4002_feature_counts_project_merv_3
#SBATCH --time=24:00:00
#SBATCH --mem=15G
#SBATCH --mail-user=gmd4002@med.cornell.edu
#SBATCH --mail-type=END,FAIL
spack load subread
Processed="/athena/angsd/scratch/gmd4002/Final_Project/sra_files_backup"
cd $Processed
echo "Starting at:" `date` >> april_20.txt
echo "This is job #:" $SLURM_JOB_ID >> april_20.txt
echo "Running on node:" `hostname` >> april_20.txt
echo "Running on cluster:" $SLURM_CLUSTER_NAME >> april_20.txt
echo "This job was assigned the temporary (local) directory:" $TMPDIR >> april_20.txt
touch ${TMPDIR}/test_file_$SLURM_JOB_ID
gDir="/athena/angsd/scratch/gmd4002/Final_Project/reference"
bams=$(ls SRR64230*/*.bam)
featureCounts -p -s 2 --extraAttributes gene_name,gene_biotype -a $gDir/gencode.v39.annotation.gtf -T 6 -o feature_counts_s2.txt ${bams[@]}
echo " Finished at " `date ` "!" >> april_20.txt
The extraAttributes command was utilized in order to capture those features, but ultimately additional feature enrichment was required downstream that due to time (and last minute computational) constraints I performed manually.
The most recent itteration of the consolidated code can be seen below. The operations are, again, out of order, and ideally should still be run separately so as to efficiently allocate shared computational resources. Do as I say, not as I do.
#! /bin/bash -l
# Batch commands
#SBATCH --partition=angsd_class
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --job-name=gmd4002_trim_test
#SBATCH --time=24:00:00
#SBATCH --mem=90G
#SBATCH --mail-user=gmd4002@med.cornell.edu
#SBATCH --mail-type=END,FAIL
spack load star@2.7.0e
spack load samtools@1.9%gcc@6.3.0
spack load -r trimgalore
spack load fastqc
spack load qorts@1.2.42
spack load -r py-multiqc
outrep=/athena/angsd/scratch/gmd4002/Final_Project/sra_files_backup
cd $outrep
echo "Starting at:" `date` >> ${outrep}/april_20.txt
echo "This is job #:" $SLURM_JOB_ID >> ${outrep}/april_20.txt
echo "Running on node:" `hostname` >> april_20.txt
echo "Running on cluster:" $SLURM_CLUSTER_NAME >> ${outrep}/april_20.txt
echo "This job was assigned the temporary (local) directory:" $TMPDIR >> ${outrep}/april_20.txt
touch ${TMPDIR}/test_file_$SLURM_JOB_ID
SRAS=( "SRR6423010" "SRR6423011" "SRR6423012" "SRR6423013" "SRR6423014" "SRR6423015" "SRR6423016" "SRR6423017" "SRR6423018" "SRR6423019" "SRR6423020" "SRR6423021" )
#SRAS=("SRR6423010")
for treatment in "${SRAS[@]}"; do
cd "$treatment"
echo "We are in " $treatment >> ${outrep}/april_20.txt
trim_galore --illumina --fastqc --paired $treatment"_pass_1.fastq.gz" $treatment"_pass_2.fastq.gz"
cd ..
done
echo "Starting STAR" $treatment >> ${outrep}/april_20.txt
for treatment in "${SRAS[@]}"; do
cd "$treatment"
STAR --runMode alignReads \
--runThreadN 16 \
--genomeDir /athena/angsd/scratch/gmd4002/Final_Project/reference/GRCh38_STARindex \
--readFilesIn $treatment"_pass_1_val_1.fq.gz" $treatment"_pass_2_val_2.fq.gz" \
--readFilesCommand zcat \
--outFileNamePrefix "$treatment." \
--outSAMtype BAM SortedByCoordinate
samtools index "$treatment.Aligned.sortedByCoord.out.bam"
echo "STAR for " $treatment "completed at " `date` >> ${outrep}/april_20.txt
cd ..
done
echo " Finished at " `date ` "!" >> ${outrep}/april_20.txt
multiqc --title "QC Results" SRR6423010/ SRR6423011/ SRR6423012/ SRR6423013/ SRR6423014/ SRR6423015/ SRR6423016/ SRR6423017/ SRR6423018/ SRR6423019/ SRR6423020/ SRR6423021/
The output of that multiqc can be seen here
FastQC Raw Data QCInitial examinations of my non-trimmed FastQC appeared normal, though with a couple annomalies.The first of these multiQC summaries of the FastQC data can be seen here
As expected, all samples failed to pass the Adapter Content test test. Additionally, there appeared to be minor contamination in samples SRR6423017 (WT_PC3_2) and SRR6423021 (SPINK1_PC3_3), which at the time I assumed to be an overloading of adapters in those two samples (experimental error.)
Once I had applied trimming to the samples, all all samples then passed the Adapter Content test, so after conferring with Merv, I decided to move on to further downstream analysis.
However, upon closer inspection during my writeup, I discovered some anomalies that might have been dealt with earlier had I checked my 24 files more carefully.
Of note was the stubborn remaining presence of over-represented sequences in SRR6423017 (WT_PC3_2) and SRR6423021 (SPINK1_PC3_3), which were seemingly uneffected by the filtering process:
Given that my reads were so deep, I had somehow rationalized that the removal of the warnings shown in my new overrepresented sequences graphs for these two samples indicated that they were merely signs of normal over-expression.
All of the samples had ~50% duplicates in all samples, but given the depth of the reads, and the fact that the warnings present in the earlier graph were removed, Merv and I seemed to feel that this was potentially a benign issue.
FastQC Reports
Checking the FastQC files directly however made it plain that these were due to specific and unfiltered TruSeq Adapter and Illumina Single End PCR Primer present in the trimmed samples.
These issues were also detectable through checking GC content,
and other metrics a more seasoned analysis might have reacted to earlier. The consiquences of not responding to this issue earier will have some downstream consiquences unfortunately, but I will do my best to address them as I proceed forward in my analysis.
Required Libraries:
Our various packages necessary for running subsiquent code.
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
## Loading required package: GenomicRanges
## Loading required package: GenomeInfoDb
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
##
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
##
## rowMedians
## The following objects are masked from 'package:matrixStats':
##
## anyMissing, rowMedians
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:S4Vectors':
##
## expand
## The following object is masked from 'package:magrittr':
##
## extract
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble 3.1.6 ✓ dplyr 1.0.8
## ✓ readr 2.1.2 ✓ stringr 1.4.0
## ✓ purrr 0.3.4 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::collapse() masks IRanges::collapse()
## x dplyr::combine() masks Biobase::combine(), BiocGenerics::combine()
## x dplyr::count() masks matrixStats::count()
## x dplyr::desc() masks IRanges::desc()
## x tidyr::expand() masks S4Vectors::expand()
## x tidyr::extract() masks magrittr::extract()
## x dplyr::filter() masks stats::filter()
## x dplyr::first() masks S4Vectors::first()
## x dplyr::lag() masks stats::lag()
## x BiocGenerics::Position() masks ggplot2::Position(), base::Position()
## x purrr::reduce() masks GenomicRanges::reduce(), IRanges::reduce()
## x dplyr::rename() masks S4Vectors::rename()
## x purrr::set_names() masks magrittr::set_names()
## x dplyr::slice() masks IRanges::slice()
##
## Attaching package: 'plyranges'
## The following objects are masked from 'package:dplyr':
##
## between, n, n_distinct
## The following object is masked from 'package:IRanges':
##
## slice
## The following object is masked from 'package:stats':
##
## filter
## Registered S3 methods overwritten by 'ggalt':
## method from
## grid.draw.absoluteGrob ggplot2
## grobHeight.absoluteGrob ggplot2
## grobWidth.absoluteGrob ggplot2
## grobX.absoluteGrob ggplot2
## grobY.absoluteGrob ggplot2
We start by reading in our feature counts data, starting with the summary:
Our column names are a little gross, so we should fix up their names and mutate our table a bit.
readcounts_summary <- readcounts_summary[c(which(readcounts_summary$ReadCounts!=0)),]
readcounts_summary %>%
ggplot(aes(sample, ReadCounts)) + geom_bar(stat="identity", aes(fill=Assignment), position="dodge") + coord_flip()
This can be seen as a recapitulation of our previous QC steps, but it’s a good sanity check none the less.
featureCounts DataNow let’s look at our actual data.
readcounts <- paste0(folder, "feature_counts_s2.txt") %>%
read.table(., header=TRUE)
names(readcounts)
## [1] "Geneid"
## [2] "Chr"
## [3] "Start"
## [4] "End"
## [5] "Strand"
## [6] "Length"
## [7] "gene_name"
## [8] "gene_biotype"
## [9] "SRR6423010.SRR6423010.Aligned.sortedByCoord.out.bam"
## [10] "SRR6423011.SRR6423011.Aligned.sortedByCoord.out.bam"
## [11] "SRR6423012.SRR6423012.Aligned.sortedByCoord.out.bam"
## [12] "SRR6423013.SRR6423013.Aligned.sortedByCoord.out.bam"
## [13] "SRR6423014.SRR6423014.Aligned.sortedByCoord.out.bam"
## [14] "SRR6423015.SRR6423015.Aligned.sortedByCoord.out.bam"
## [15] "SRR6423016.SRR6423016.Aligned.sortedByCoord.out.bam"
## [16] "SRR6423017.SRR6423017.Aligned.sortedByCoord.out.bam"
## [17] "SRR6423018.SRR6423018.Aligned.sortedByCoord.out.bam"
## [18] "SRR6423019.SRR6423019.Aligned.sortedByCoord.out.bam"
## [19] "SRR6423020.SRR6423020.Aligned.sortedByCoord.out.bam"
## [20] "SRR6423021.SRR6423021.Aligned.sortedByCoord.out.bam"
Looking at the column data, our sample data is mixed together with additional fields. We will need to separate those out, and then alter the names of our samples for easier usage.
I first store the original names as a backup, as well as the specific SRR codes of each sample using str_extract, then rewrite the names of readcounts according to the SRA Run Selector Table.
Lastly, I store all usefull non-sample data in a dataframe I call first_cols. I opt to leave out column 8, as I had initially attempted to return the gene_biotype, but all of my results returned NA for this field.
orig_names <- names(readcounts)
orig_SRRs <- str_extract(orig_names[-seq(1,8)], "^[^.]+")
names(readcounts) <- c(names(readcounts)[1:8],
paste("WT", "DU145", c(1:3), sep = "_"),
paste("SPINK1", "DU145", c(1:3), sep = "_"),
paste("WT", "PC3", c(1:3), sep = "_"),
paste("SPINK1", "PC3", c(1:3), sep = "_") )
colnames(readcounts)
## [1] "Geneid" "Chr" "Start" "End"
## [5] "Strand" "Length" "gene_name" "gene_biotype"
## [9] "WT_DU145_1" "WT_DU145_2" "WT_DU145_3" "SPINK1_DU145_1"
## [13] "SPINK1_DU145_2" "SPINK1_DU145_3" "WT_PC3_1" "WT_PC3_2"
## [17] "WT_PC3_3" "SPINK1_PC3_1" "SPINK1_PC3_2" "SPINK1_PC3_3"
first_cols <- readcounts[,1:7] %>% as.data.frame
I then set the names of my rows to be my Geneids, which is a little difficult to read, but will be replaced with the gene names stored in first_cols where appropriate.
## gene IDs should be stored as row.names
row.names(readcounts) <- make.names(readcounts$Geneid)
## info such as genomic coordinates)
readcounts <- readcounts[ , -seq(1:8)]
head(readcounts)
## WT_DU145_1 WT_DU145_2 WT_DU145_3 SPINK1_DU145_1
## ENSG00000223972.5 0 0 0 0
## ENSG00000227232.5 80 44 61 33
## ENSG00000278267.1 27 25 26 21
## ENSG00000243485.5 0 0 0 0
## ENSG00000284332.1 0 0 0 0
## ENSG00000237613.2 0 0 0 0
## SPINK1_DU145_2 SPINK1_DU145_3 WT_PC3_1 WT_PC3_2 WT_PC3_3
## ENSG00000223972.5 0 0 0 0 0
## ENSG00000227232.5 12 19 12 56 37
## ENSG00000278267.1 23 12 4 3 6
## ENSG00000243485.5 0 0 0 0 1
## ENSG00000284332.1 0 0 0 0 0
## ENSG00000237613.2 0 0 0 0 0
## SPINK1_PC3_1 SPINK1_PC3_2 SPINK1_PC3_3
## ENSG00000223972.5 0 0 0
## ENSG00000227232.5 41 34 49
## ENSG00000278267.1 7 7 5
## ENSG00000243485.5 0 1 1
## ENSG00000284332.1 0 0 0
## ENSG00000237613.2 0 0 0
I’ll also need to store information from my readcounts object that I can reference later. This not only specifies how my results are differentiated (WT vs. SPINK1 treatment, and DU145 vs. PC3 cell lines), but also includes the original SRR labels.
# let's use the info from our readcounts object
sample_info <- data.frame('condition' = c('WT','WT','WT','SPINK1','SPINK1','SPINK1','WT','WT','WT','SPINK1','SPINK1','SPINK1'), 'celltype' = c('DU145','DU145','DU145','DU145','DU145','DU145','PC3','PC3','PC3','PC3','PC3','PC3'), 'SRR' = orig_SRRs,
row.names = names(readcounts) )
sample_info[3]
## SRR
## WT_DU145_1 SRR6423010
## WT_DU145_2 SRR6423011
## WT_DU145_3 SRR6423012
## SPINK1_DU145_1 SRR6423013
## SPINK1_DU145_2 SRR6423014
## SPINK1_DU145_3 SRR6423015
## WT_PC3_1 SRR6423016
## WT_PC3_2 SRR6423017
## WT_PC3_3 SRR6423018
## SPINK1_PC3_1 SRR6423019
## SPINK1_PC3_2 SRR6423020
## SPINK1_PC3_3 SRR6423021
I also opted to further enrich my results by manually mapping in additional genomic features to my featurecounts data that were left out of my genereated featureCounts due to time constriants. This ultimately resulted in my conducting several forms of parallel analysis that weren’t terribly fruitful, but I’ll leave them in as they might be potential vectors for future improvement if given more time and R familiarity.
I downloaded my human GTF file, and grabbed the gene_id (analagous to gene entrezID apparently), the gene_type, and gene_name in a new dataframe. I then created a second dataframe containing my readcounts data left joined to the first. Unfortunately for me, there seemed to be multiple lines in the GTF originating dataframe, resulting in an increase in the size of the left_joined readcounts dataframe.
genes_df <-
read_gff(paste0(folder, "gencode.v39.annotation.gtf") ,
genome_info="hg38") %>%
select(gene_id, gene_type, gene_name) %>%
as.data.frame()
gene_info_df <-
data.frame(gencode_id = rownames(readcounts)) %>%
left_join(., genes_df,
by = c('gencode_id' = 'gene_id')) %>%
select(gencode_id, gene_name, gene_type) %>%
distinct()
That said, I now had the ability to better screen and process my data, as well as make some preliminary comparisions to the findings made in the paper.
sapply(gene_info_df, n_distinct)
gene_info_df %>% count(gene_type) %>% arrange(desc((n)))
While I couldn’t figure out how to properly group these fields such that there wasn’t such rampant overcounting, it’s important to note that the relative percentage of protein_coding genes compared with all others is still very representative of what was found in the paper.
Another avenue of attack I wanted to try was to filter down my genes to ensure that I only carried over protein_coding genes and non-mRNA reads. This latter step removed surprisingly few samples, and ultimately didn’t have very dramatic effect on the results of the analysis.
gene_info_df_filtered <-
gene_info_df %>%
filter(gene_type == 'protein_coding',
!str_detect(gene_name, '^MT-')) %>%
distinct()
readcountslimited <-
readcounts[gene_info_df_filtered$gencode_id, ]
subset(gene_info_df_filtered, genecode_id = rownames(readcounts)) %>% count(gene_type) %>% arrange(desc((n)))
sapply(readcountslimited, n_distinct)
sapply(gene_info_df_filtered, n_distinct)
subset(gene_info_df_filtered, genecode_id = colnames(readcountslimited)) %>% count(gene_type) %>% arrange(desc((n)))
After that, I was ready to begin creating my DESeq2 object
Some explanation is required here, as I had previously mentioned that I intended to analyze my results from multiple perspectives.
When first selecting my dataset, I was unaware of the fact that the RNASeq data was not a single condition assay, but rather two such assays done in parallel. This complicated my analysis for quite some time, as my sample had two conditions, two celltypes, and an assumed interaction term, and I had difficulty determining the best route forward. Initially I had presumed that splitting my data on the basis of record would be one way to proceed, and my initial results reflected that first attempt to process my data and generate analytical downstream results.
However, after talking with Merv several times, he pointed me in the direction of his sample of DESeq2 analysis performed on a similarly structured sample dataset (example 3). This is what ultimately lead me to create my DESeq.ds object seen immediately below:
DESeq.ds <- DESeqDataSetFromMatrix(countData = as.matrix(readcounts),
colData = sample_info,
rowData = first_cols,
design = ~ celltype + condition + celltype:condition)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
However, additional experiments were performed on my part, splitting my data in a variety of ways (splitting by celltype, before and after the creation of the DESeq2 object, creating a DESeq2 object from my previously protein-coding/non-mRNA based filtering, and splitting of those samples). In retrospect, I really should have concentrated my efforts elsewhere, but you live and learn.
DESeq.ds <- DESeqDataSetFromMatrix(countData = as.matrix(readcounts),
colData = sample_info,
rowData = first_cols,
design = ~ celltype + condition + celltype:condition)
DESeq.ds.DU <- DESeq.ds[ , DESeq.ds$celltype %in% c("DU145")]
DESeq.ds.PC <- DESeq.ds[ , DESeq.ds$celltype %in% c("PC3")]
##########
DESeq.ds.filter <- DESeqDataSetFromMatrix(countData = as.matrix(readcountslimited),
colData = sample_info,
rowData = first_cols[first_cols$Geneid %in% rownames(readcountslimited),],
design = ~ celltype + condition + celltype:condition)
DESeq.ds.DU.filter <- DESeq.ds.filter[ , DESeq.ds.filter$celltype %in% c("DU145")]
DESeq.ds.PC.filter <- DESeq.ds.filter[ , DESeq.ds.filter$celltype %in% c("PC3")]
##########
DESeq.ds.DU.split <- DESeqDataSetFromMatrix(countData = as.matrix(readcounts[,1:6]),
colData = sample_info[1:6,],
rowData = first_cols,
design = ~ condition)
DESeq.ds.PC.split <- DESeqDataSetFromMatrix(countData = as.matrix(readcounts[,7:12]),
colData = sample_info[7:12,],
rowData = first_cols,
design = ~ condition)
I will document the DESeq2 object processing and analysis for my DESeq.ds object, but the various steps seen here were also performed in parellel during the process of completing this assignment. The work associated with that will be kept in a separate workbook, and will be uploaded along with the data object associated with them.
Filtering Gene Counts
We start by filtering out all non-expressing genes from our object.
dim(DESeq.ds)
## [1] 61533 12
keep_genes <- rowSums(counts(DESeq.ds)) > 0
DESeq.ds <- DESeq.ds[ keep_genes, ]
dim(DESeq.ds)
## [1] 35156 12
We can then plot our newly filtered counts in a barplot:
colSums(counts(DESeq.ds)) %>% barplot(las=2)
Moving forward, we will filter our DESeqDataSet as to only include genes that have a read mapped for at least one sample.
To make sure that the comparisons we make between our read counts across different samples, we’ll have to normalize our raw counts for sequencing depth, and overall gene expression. Fortunately both of these steps can be conducted using DESeq2.
Normalizing Read Counts with Size Factors
First, we’ll start by normalizing by estimating our size factor
DESeq.ds <- estimateSizeFactors(DESeq.ds) # calculate SFs, add them to object
plot( sizeFactors(DESeq.ds), colSums(counts(DESeq.ds)), # assess them
ylab = "library sizes", xlab = "size factors", cex = .6)
text( sizeFactors(DESeq.ds), colSums(counts(DESeq.ds)), labels=colnames(DESeq.ds), pos=1, cex = 0.4)
We’re making a few assumptions here, that our size factors should be ~1, and scale with the number of reads per sample is met. While our line isn’t perfect, this seems to largely hold true for our data, which is a good sign. Additional samples would have been even better given the statistical issues that could arise from only repeating your experimental samples in triplicate, but that’s neither here nor there for now.
Note that here, due to our DESeq2 object containing two conditions and two celltypes, the boxplots are huge, making it difficult to perform this sort of analysis. Some of my work conducted in other rmd files will be recapitulated here for the sake of demonstrating some of these graphs.
DESeq.ds2 <- estimateSizeFactors(DESeq.ds[,1:6]) # calculate SFs, add them to object
DESeq.ds3 <- estimateSizeFactors(DESeq.ds[,1:6]) # calculate SFs, add them to object
par(mfrow=c(1,2))
plot( sizeFactors(DESeq.ds2), colSums(counts(DESeq.ds[,1:6])), # assess them
ylab = "library sizes", xlab = "size factors", cex = .6)
text( sizeFactors(DESeq.ds2), colSums(counts(DESeq.ds[,1:6])), labels=colnames(DESeq.ds[,1:6]), pos=1, cex = 0.4)
plot( sizeFactors(DESeq.ds3), colSums(counts(DESeq.ds[,7:12])), # assess them
ylab = "library sizes", xlab = "size factors", cex = .6)
text( sizeFactors(DESeq.ds2), colSums(counts(DESeq.ds[,7:12])), labels=colnames(DESeq.ds[,7:12]), pos=1, cex = 0.4)
Separating out our previous graph we can more clearly see that our reads are comfortably close to our estimated size factor. Our outliers also seem to be more easily identifiable, and due to the two samples with overrepresented sequences.
To further check our results, we can examine the influence of sequencing depth normalization checking the log2 counts as well:
## bp of size-factor normalized values
boxplot(log2(counts(DESeq.ds, normalize= TRUE) +1), notch=TRUE,
main = "Full Size-factor-normalized read counts",
ylab="log2(read counts)", cex = .6, las=2)
Note again that the sizable boxplots are due to the nature of my dataset. Boxplot analysis performmed on these sets of data separately can be seen in my additional rmd files.
We can check the log counts of our various results, below two are chosen from our DU145 data, comparing WT to SPINK1
## non-normalized read counts plus pseudocount
log.counts <- log2(counts(DESeq.ds, normalized = FALSE) + 1)
## instead of creating a new object, we could assign the values to a distinct matrix
## within the DESeq.ds object
assay(DESeq.ds, "log.counts") <- log2(counts(DESeq.ds, normalized = FALSE) + 1)
## normalized read counts
log.norm.counts <- log2(counts(DESeq.ds, normalized=TRUE) + 1)
assay(DESeq.ds, "log.norm.counts") <- log.norm.counts
par(mfrow=c(1,2))
DESeq.ds[, c("WT_DU145_1","WT_DU145_2")] %>%
assay(., "log.norm.counts") %>%
plot(., cex=.1, main = "DU145 WT_1 vs. WT_2")
DESeq.ds[, c("WT_DU145_1","SPINK1_DU145_1")] %>%
assay(., "log.norm.counts") %>%
plot(., cex=.1, main = "DU145 WT_1 vs SPINK1_1")
In comparison, here are our results comparing one of our good results from the PC3 RNA results with the two contaminated samples. Note the fanning out of points upwards on the WT_1 vs WT_2 graph.
par(mfrow=c(2,2))
DESeq.ds[, c("WT_PC3_1","WT_PC3_3")] %>%
assay(., "log.norm.counts") %>%
plot(., cex=.1, main = "PC3 WT_1 vs WT_3")
DESeq.ds[, c("WT_PC3_1","WT_PC3_2")] %>%
assay(., "log.norm.counts") %>%
plot(., cex=.1, main = "PC3 WT_1 vs. WT_2")
DESeq.ds[, c("WT_PC3_1","SPINK1_PC3_1")] %>%
assay(., "log.norm.counts") %>%
plot(., cex=.1, main = "PC3 WT_1 vs SPINK1_1")
DESeq.ds[, c("WT_PC3_1","SPINK1_PC3_2")] %>%
assay(., "log.norm.counts") %>%
plot(., cex=.1, main = "PC3 WT_1 vs SPINK1_2")
Additionally in both celltypes, there is noticible fanning out where readcounts are low, indicating the possibility that the standard deviation of the expression levels might depend on the mean. To assess this I’ll utilize some code used in class:
## generate the base meanSdPlot using sequencing depth normalized log2(read counts)
log.norm.counts <- log2(counts(DESeq.ds, normalized=TRUE) + 1)
## set up ploting frames
par(mfrow=c(1,1))
## generate the plot
msd_plot <- vsn::meanSdPlot(log.norm.counts,
ranks=FALSE, # show the data on the original scale
plot = FALSE)
## since vsn::meanSdPlot generates a ggplot2 object, this can be
## manipulated in the usual ways
msd_plot$gg +
ggtitle("Sequencing depth normalized log2(read counts)") +
ylab("standard deviation")
As expected, when applied to the complete dataset, the mean is still relatively high. Applied separately it lowers considerably, but is still far from horizontal. To compensate for this, we’ll need to reduce the variance on the mean.
par(mfrow=c(1,2))
DESeq.rlog <- rlog(DESeq.ds, blind = TRUE)
plot(assay(DESeq.rlog)[,1],
assay(DESeq.rlog)[,2],
cex=.1, main = "rlog transformed",
xlab = colnames(assay(DESeq.rlog[,1])),
ylab = colnames(assay(DESeq.rlog[,2])) )
DESeq.rlog <- rlog(DESeq.ds, blind = FALSE)
plot(assay(DESeq.rlog)[,1],
assay(DESeq.rlog)[,2],
cex=.1, main = "rlog transformed",
xlab = colnames(assay(DESeq.rlog[,1])),
ylab = colnames(assay(DESeq.rlog[,2])) )
## this actually generates a different type of object!
DESeq.rlog <- rlog(DESeq.ds, blind = TRUE)
## set blind = FALSE if the conditions are expected to introduce
## strong differences in a large proportion of the genes
Let’s visually check the results of the rlog transformation:
par(mfrow=c(1,2))
plot(log.norm.counts[,1:2], cex=.1,
main = "size factor and log2-transformed")
## the rlog-transformed counts are stored in the accessor "assay"
plot(assay(DESeq.rlog)[,1],
assay(DESeq.rlog)[,2],
cex=.1, main = "rlog transformed",
xlab = colnames(assay(DESeq.rlog[,1])),
ylab = colnames(assay(DESeq.rlog[,2])) )
rlog.norm.counts <- assay(DESeq.rlog)
Using rlog has tightened our variance significantly
Replotting, we get:
## rlog-transformed read counts
msd_plot <- vsn::meanSdPlot( rlog.norm.counts, ranks=FALSE, plot = FALSE)
msd_plot$gg + ggtitle("Following rlog transformation") +
coord_cartesian(ylim = c(0,3))
Which is a noticible improvement, but naturally only goes so far to justifiy the utilization of this method. See the results of this analysis performed on my separated datasets below:
par(mfrow=c(1,2))
msd_plot.DU$gg + ggtitle("DU Following rlog transformation") +
coord_cartesian(ylim = c(0,3))
msd_plot.PC$gg + ggtitle("PC Following rlog transformation") +
coord_cartesian(ylim = c(0,3))
Additional transformations are also possible to implement, and produce interesting results:
library("vsn")
ntd <- normTransform(DESeq.ds)
vsd <- vst(DESeq.ds, blind=FALSE)
rld <- rlog(DESeq.ds, blind=FALSE)
ntd <- normTransform(DESeq.ds)
meanSdPlot(assay(ntd))
meanSdPlot(assay(vsd))
meanSdPlot(assay(rld))
meanSdPlot(assay(ntd))
Our strongest results are generated from our variance stabilizing transformation, and with good reason! Given the substantial variance introduced into our dataset due to it having multiple conditions and celltypes, this came closest to producing the results we were looking for.
head(assay(vsd), 3)
## WT_DU145_1 WT_DU145_2 WT_DU145_3 SPINK1_DU145_1
## ENSG00000227232.5 8.081719 7.786054 7.992472 7.636698
## ENSG00000278267.1 7.590713 7.567382 7.615938 7.485835
## ENSG00000243485.5 6.878561 6.878561 6.878561 6.878561
## SPINK1_DU145_2 SPINK1_DU145_3 WT_PC3_1 WT_PC3_2 WT_PC3_3
## ENSG00000227232.5 7.345373 7.471539 7.338545 7.825051 7.627084
## ENSG00000278267.1 7.522302 7.351022 7.144881 7.101359 7.182814
## ENSG00000243485.5 6.878561 6.878561 6.878561 6.878561 7.002964
## SPINK1_PC3_1 SPINK1_PC3_2 SPINK1_PC3_3
## ENSG00000227232.5 7.651490 7.563537 7.794414
## ENSG00000278267.1 7.201094 7.191676 7.175532
## ENSG00000243485.5 6.878561 6.997106 7.011558
Lastly, we can perform some PCA analysis
z <- plotPCA(DESeq.rlog,intgroup=c("condition", "celltype")) + coord_cartesian()
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
z + geom_label(aes(label=colnames(DESeq.rlog), size=NULL), nudge_y = 0.7 , label.size=0.25)
Note again the PC3 outlier far removed from the others. The variance of this graph almost completely captured by the difference in celltype, which makes perfect sense. When drilling down into that PCA specifically, we can also see that when considering the PCA for that set alone, that difference accounts for 22% of all total variance:
plotPCA(DESeq.rlog.PC,intgroup=c("condition", "celltype")) + coord_cartesian()
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
Now we move on to the differential expression analyisis. This process is covered in more detail (and for the additional DESeq2 objects) in other rmds stored for the purose, but the essential steps for the generation of my results will be recapitulated here.
Note that if given more time, it would have probably been a good idea to go back and reconsider some assumptions made about my data prior to running DESeq. DESeq itself runs estimateSizeFators(), estimateDispersions(), and the nbionomWaldTest() on your input, and some further insights might have been gained by testing out these various functions separately and with alternate parameters set. C’est la vie, non?
Instead, I merely ran the DESeq, seen below:
DESeq.ds <- DESeq(DESeq.ds)
## using pre-existing size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
DESeq.ds.DU$condition <- relevel(DESeq.ds.DU$condition, ref="WT")
DESeq.ds.DU$condition
## [1] WT WT WT SPINK1 SPINK1 SPINK1
## Levels: WT SPINK1
DESeq.ds.DU.split <- DESeq(DESeq.ds.DU.split)
## using pre-existing size factors
## estimating dispersions
## found already estimated dispersions, replacing these
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
DESeq.ds.PC.split <- DESeq(DESeq.ds.PC.split)
## using pre-existing size factors
## estimating dispersions
## found already estimated dispersions, replacing these
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
The various altered DESeq objects can be seen below:
DESeq.ds
## class: DESeqDataSet
## dim: 35156 12
## metadata(1): version
## assays(6): counts log.counts ... H cooks
## rownames(35156): ENSG00000227232.5 ENSG00000278267.1 ...
## ENSG00000210195.2 ENSG00000210196.2
## rowData names(37): Geneid Chr ... deviance maxCooks
## colnames(12): WT_DU145_1 WT_DU145_2 ... SPINK1_PC3_2 SPINK1_PC3_3
## colData names(4): condition celltype SRR sizeFactor
DESeq.ds.DU.split
## class: DESeqDataSet
## dim: 61533 6
## metadata(1): version
## assays(4): counts mu H cooks
## rownames(61533): ENSG00000223972.5 ENSG00000227232.5 ...
## ENSG00000210195.2 ENSG00000210196.2
## rowData names(29): Geneid Chr ... deviance maxCooks
## colnames(6): WT_DU145_1 WT_DU145_2 ... SPINK1_DU145_2 SPINK1_DU145_3
## colData names(4): condition celltype SRR sizeFactor
DESeq.ds.PC.split
## class: DESeqDataSet
## dim: 61533 6
## metadata(1): version
## assays(4): counts mu H cooks
## rownames(61533): ENSG00000223972.5 ENSG00000227232.5 ...
## ENSG00000210195.2 ENSG00000210196.2
## rowData names(29): Geneid Chr ... deviance maxCooks
## colnames(6): WT_PC3_1 WT_PC3_2 ... SPINK1_PC3_2 SPINK1_PC3_3
## colData names(4): condition celltype SRR sizeFactor
Given our deeply sequenced data (and the nature of RNA-seq data in general), several steps need to be taken in order to be able to make statistically significant inferences about our data, and generate robust p-values For RNA-seq, the argument has been made that genes with very low read counts can be ignored for downstream analyses as their read counts are often too unreliable and variable to be accurately assessed with only 3-5. As I mentioned before, this is a very unfortunate situation for me given that I chose this dataset SPECIFICALLY because I assumed it was two sets of 6, rather than 4 sets of 3. However, all is not lost. We need to find ways to remove genes that would have a low chance of being identified as statistically significant and remove them prior to further analysis.
We can start this process by using results() to extract the base means across samples. This code can be seen below:
DGE.results <- results(DESeq.ds, independentFiltering = TRUE, alpha = 0.05)
DGE.results.DU.split <- results(DESeq.ds.DU.split, independentFiltering = TRUE, alpha = 0.05)
DGE.results.PC.split <- results(DESeq.ds.PC.split, independentFiltering = TRUE, alpha = 0.05)
resultsNames(DESeq.ds)
## [1] "Intercept" "celltype_PC3_vs_DU145"
## [3] "condition_WT_vs_SPINK1" "celltypePC3.conditionWT"
Note that in the case of DESeq.ds, we have additional comparisons to uuse in order to derive meaningful comparisons. This is of course because for my non-separated datasets, we have two conditions and two celltypes.
Looking at our results:
head(DGE.results)
## log2 fold change (MLE): celltypePC3.conditionWT
## Wald test p-value: celltypePC3.conditionWT
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000227232.5 40.161170 -1.840692 0.623193 -2.953644 0.00314046
## ENSG00000278267.1 14.523120 -1.049743 0.736521 -1.425271 0.15407880
## ENSG00000243485.5 0.222587 -0.950574 5.847527 -0.162560 0.87086491
## ENSG00000238009.6 4.906635 -1.169493 1.260281 -0.927962 0.35342727
## ENSG00000268903.1 95.537621 -0.900403 0.313581 -2.871354 0.00408718
## ENSG00000269981.1 107.598001 -0.603616 0.314291 -1.920566 0.05478639
## padj
## <numeric>
## ENSG00000227232.5 0.0110452
## ENSG00000278267.1 0.2798528
## ENSG00000243485.5 NA
## ENSG00000238009.6 0.5175143
## ENSG00000268903.1 0.0138450
## ENSG00000269981.1 0.1232036
head(DGE.results.DU.split)
## log2 fold change (MLE): condition WT vs SPINK1
## Wald test p-value: condition WT vs SPINK1
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000223972.5 0.0000 NA NA NA NA
## ENSG00000227232.5 42.1171 1.65518 0.359393 4.60549 4.1150e-06
## ENSG00000278267.1 22.3824 0.60248 0.412833 1.45938 1.4446e-01
## ENSG00000243485.5 0.0000 NA NA NA NA
## ENSG00000284332.1 0.0000 NA NA NA NA
## ENSG00000237613.2 0.0000 NA NA NA NA
## padj
## <numeric>
## ENSG00000223972.5 NA
## ENSG00000227232.5 2.36189e-05
## ENSG00000278267.1 2.55515e-01
## ENSG00000243485.5 NA
## ENSG00000284332.1 NA
## ENSG00000237613.2 NA
head(DGE.results.PC.split)
## log2 fold change (MLE): condition WT vs SPINK1
## Wald test p-value: condition WT vs SPINK1
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000223972.5 0.000000 NA NA NA NA
## ENSG00000227232.5 37.901379 -0.190574 0.463464 -0.411195 0.680929
## ENSG00000278267.1 5.224447 -0.451561 0.884889 -0.510303 0.609839
## ENSG00000243485.5 0.489413 -0.829878 3.023198 -0.274503 0.783698
## ENSG00000284332.1 0.000000 NA NA NA NA
## ENSG00000237613.2 0.000000 NA NA NA NA
## padj
## <numeric>
## ENSG00000223972.5 NA
## ENSG00000227232.5 0.80969
## ENSG00000278267.1 NA
## ENSG00000243485.5 NA
## ENSG00000284332.1 NA
## ENSG00000237613.2 NA
# the DESeqResult object can basically be handled like a data.frame
table(DGE.results$padj < 0.05)
##
## FALSE TRUE
## 11840 6906
table(DGE.results.DU.split$padj < 0.05)
##
## FALSE TRUE
## 11282 7527
table(DGE.results.PC.split$padj < 0.05)
##
## FALSE TRUE
## 12064 5034
From this we can see that we have knocked out several unnecessary values, while still retaining a respectable number of genes to work with.
Plotting histograms of our adjusted p_values shows a respectable ratio of low adjusted p-values for our genes.
hist(DGE.results$padj,
col="grey", border="white", xlab="", ylab="",
main="frequencies of adj. p-values\n(all genes)",
cex = 0.4)
hist(DGE.results.DU.split$padj,
col="blue", border="white", xlab="", ylab="",
main="frequencies of adj. p-values\n(all genes)",
cex = 0.4)
hist(DGE.results.PC.split$padj,
col="red", border="white", xlab="", ylab="",
main="frequencies of adj. p-values\n(all genes)",
cex = 0.4)
This can be further demonstrated using heatmaps, which are able to better visualize this data with respect to the originating samples:
# identify genes with the desired adjusted p-value cut-off
DGEgenes <- rownames(subset(DGE.results.sorted, padj < 0.05))
DGEgenes.DU.split <- rownames(subset(DGE.results.sorted.DU.split, padj < 0.05))
DGEgenes.PC.split <- rownames(subset(DGE.results.sorted.PC.split, padj < 0.05))
# extract rlog-transformed values into a matrix
rlog.dge <- DESeq.rlog[DGEgenes,] %>% assay
rlog.dge.DU.split <- DESeq.rlog.DU.split[DGEgenes.DU.split,] %>% assay
rlog.dge.PC.split <- DESeq.rlog.PC.split[DGEgenes.PC.split,] %>% assay
library(pheatmap)
# heatmap of DEG sorted by p.adjust
pheatmap(rlog.dge, scale="none",
show_rownames = FALSE, main = "DGE (no scaling)",
color = colorRampPalette(RColorBrewer::brewer.pal(n = 7, name = "Reds"))(100))
pheatmap(rlog.dge.DU.split, scale="none",
show_rownames = FALSE, main = "DGE (no scaling)",
color = colorRampPalette(RColorBrewer::brewer.pal(n = 7, name = "Reds"))(100))
pheatmap(rlog.dge.PC.split, scale="none",
show_rownames = FALSE, main = "DGE (no scaling)",
color = colorRampPalette(RColorBrewer::brewer.pal(n = 7, name = "Reds"))(100))
A clearer (and more colorful) picture can be painted if we scale our data.
pheatmap(rlog.dge, scale="row",
show_rownames = FALSE, main = "DGE (row-based z-score)")
Shown when controlling for celltype makes the results even more dramatic:
par(mfrow=c(1,2))
pheatmap(rlog.dge[,1:6], scale="row",
show_rownames = FALSE, main = "DU145 DGE (row-based z-score)")
pheatmap(rlog.dge[,7:12], scale="row",
show_rownames = FALSE, main = "PC3 DGE (row-based z-score)")
And again, the same analysis is performed on the split dataset
pheatmap(rlog.dge.DU.split, scale="row",
show_rownames = FALSE, main = "DU145 DGE (row-based z-score)")
pheatmap(rlog.dge.PC.split, scale="row",
show_rownames = FALSE, main = "PC3 DGE (row-based z-score)")
Though most colorfully of all, we can present a headmap with respect to both the celltype and condition simultaneously:
select <- order(rowMeans(counts(DESeq.ds,normalized=TRUE)),
decreasing=TRUE)[1:20]
#df <- as.data.frame(colData(DESeq.ds)[,c("condition","celltype")])
#pheatmap(assay(ntd)[select,], cluster_rows=FALSE, show_rownames=FALSE,
# cluster_cols=FALSE, annotation_col=df)
pheatmap(assay(vsd)[select,], cluster_rows=FALSE, show_rownames=FALSE,
cluster_cols=FALSE, annotation_col=df)
pheatmap(assay(rld)[select,], cluster_rows=FALSE, show_rownames=FALSE,
cluster_cols=FALSE, annotation_col=df)
Genes Found to be Differentially Expressed
To translate this data into more gene-specific format, we will need to examine them not in bulk, but individually. By assessing their individual p-scores, generated from a set of results where we take into account the effects that the condition and celltype might have on one another, and contrast specifically the condition, we can emerge with a set of genes that exhibit the most dramatic effects of up and down regulation. Thanks to Merv for providing the setup for this code, which I am using due to time constraints.
MIN_FC=2
MIN_L2FC=log2(MIN_FC)
MAX_QVAL=0.05
N_TOP_GENES=25
FOI="condition"
LEVEL_0="SPINK1"
LEVEL_1="WT"
dsr <- results(DESeq.ds, saveCols=c("Geneid", "gene_name"),
contrast=c(FOI, LEVEL_1, LEVEL_0),
independentFiltering=TRUE) %>%
as_tibble() %>%
select(Geneid, gene_name, everything()) %>%
arrange(pvalue)
dsr %>%
filter(log2FoldChange >= MIN_L2FC) %>%
head(n=N_TOP_GENES) %>%
knitr::kable()
| Geneid | gene_name | baseMean | log2FoldChange | lfcSE | stat | pvalue | padj |
|---|---|---|---|---|---|---|---|
| ENSG00000112096.19 | SOD2 | 4321.1657 | 2.079657 | 0.0657574 | 31.62619 | 0 | 0 |
| ENSG00000101846.9 | STS | 826.9620 | 2.720844 | 0.0889568 | 30.58614 | 0 | 0 |
| ENSG00000187837.4 | H1-2 | 3935.7541 | 1.888783 | 0.0638229 | 29.59414 | 0 | 0 |
| ENSG00000070610.14 | GBA2 | 2461.9509 | 1.774416 | 0.0766159 | 23.15990 | 0 | 0 |
| ENSG00000161011.20 | SQSTM1 | 13279.3559 | 1.210689 | 0.0533445 | 22.69570 | 0 | 0 |
| ENSG00000100906.11 | NFKBIA | 5402.5796 | 1.149527 | 0.0508256 | 22.61709 | 0 | 0 |
| ENSG00000131069.20 | ACSS2 | 6818.8524 | 1.291155 | 0.0587479 | 21.97789 | 0 | 0 |
| ENSG00000234745.13 | HLA-B | 7069.6583 | 1.229550 | 0.0564918 | 21.76512 | 0 | 0 |
| ENSG00000184678.11 | H2BC21 | 1071.0992 | 1.605273 | 0.0739799 | 21.69875 | 0 | 0 |
| ENSG00000172893.18 | DHCR7 | 8449.2750 | 1.158587 | 0.0538409 | 21.51873 | 0 | 0 |
| ENSG00000180573.10 | H2AC6 | 1473.7340 | 1.764880 | 0.0822612 | 21.45460 | 0 | 0 |
| ENSG00000162804.14 | SNED1 | 394.3964 | 3.012808 | 0.1411035 | 21.35176 | 0 | 0 |
| ENSG00000158406.6 | H4C8 | 692.0241 | 2.415821 | 0.1135602 | 21.27347 | 0 | 0 |
| ENSG00000099194.6 | SCD | 37572.2804 | 1.306388 | 0.0634102 | 20.60218 | 0 | 0 |
| ENSG00000105339.11 | DENND3 | 1043.7403 | 1.451382 | 0.0704907 | 20.58969 | 0 | 0 |
| ENSG00000158373.9 | H2BC5 | 631.0605 | 1.862091 | 0.0915798 | 20.33298 | 0 | 0 |
| ENSG00000137642.13 | SORL1 | 2857.3857 | 1.223134 | 0.0631875 | 19.35723 | 0 | 0 |
| ENSG00000134824.14 | FADS2 | 17275.7960 | 1.139829 | 0.0606336 | 18.79864 | 0 | 0 |
| ENSG00000204386.12 | NEU1 | 1864.4644 | 1.282594 | 0.0684657 | 18.73337 | 0 | 0 |
| ENSG00000113916.18 | BCL6 | 574.4218 | 1.753265 | 0.0956102 | 18.33764 | 0 | 0 |
| ENSG00000125826.21 | RBCK1 | 5982.2478 | 1.734707 | 0.0960928 | 18.05242 | 0 | 0 |
| ENSG00000075223.14 | SEMA3C | 3295.4722 | 1.122806 | 0.0622329 | 18.04200 | 0 | 0 |
| ENSG00000198888.2 | MT-ND1 | 31816.1881 | 1.140324 | 0.0639381 | 17.83481 | 0 | 0 |
| ENSG00000126709.16 | IFI6 | 2385.4880 | 2.143352 | 0.1204181 | 17.79925 | 0 | 0 |
| ENSG00000167508.12 | MVD | 2848.0057 | 1.494505 | 0.0842575 | 17.73735 | 0 | 0 |
dsr %>%
filter(log2FoldChange <= -MIN_L2FC) %>%
head(n=N_TOP_GENES) %>%
knitr::kable()
| Geneid | gene_name | baseMean | log2FoldChange | lfcSE | stat | pvalue | padj |
|---|---|---|---|---|---|---|---|
| ENSG00000187908.20 | DMBT1 | 507.6636 | -2.629646 | 0.1035662 | -25.39096 | 0 | 0 |
| ENSG00000164611.13 | PTTG1 | 1162.0332 | -2.031179 | 0.0836420 | -24.28419 | 0 | 0 |
| ENSG00000089685.15 | BIRC5 | 1933.0627 | -2.260010 | 0.0940635 | -24.02643 | 0 | 0 |
| ENSG00000147689.17 | FAM83A | 5063.4695 | -2.240927 | 0.0985256 | -22.74461 | 0 | 0 |
| ENSG00000169679.15 | BUB1 | 1500.5314 | -2.558618 | 0.1159074 | -22.07467 | 0 | 0 |
| ENSG00000100526.21 | CDKN3 | 971.6392 | -2.028783 | 0.0925965 | -21.90992 | 0 | 0 |
| ENSG00000134057.15 | CCNB1 | 2373.3330 | -2.225994 | 0.1033099 | -21.54677 | 0 | 0 |
| ENSG00000068489.13 | PRR11 | 1750.1396 | -2.316519 | 0.1087562 | -21.30012 | 0 | 0 |
| ENSG00000117399.14 | CDC20 | 1834.1577 | -2.534666 | 0.1194623 | -21.21729 | 0 | 0 |
| ENSG00000109971.14 | HSPA8 | 25437.1401 | -1.055817 | 0.0540125 | -19.54765 | 0 | 0 |
| ENSG00000108602.18 | ALDH3A1 | 431.5669 | -2.041205 | 0.1046662 | -19.50204 | 0 | 0 |
| ENSG00000087586.18 | AURKA | 1872.2675 | -2.144478 | 0.1109483 | -19.32862 | 0 | 0 |
| ENSG00000126787.13 | DLGAP5 | 1753.5930 | -2.683979 | 0.1394519 | -19.24664 | 0 | 0 |
| ENSG00000240476.1 | LINC00973 | 522.3614 | -1.895641 | 0.0988802 | -19.17109 | 0 | 0 |
| ENSG00000080824.19 | HSP90AA1 | 36155.1540 | -1.308070 | 0.0690522 | -18.94321 | 0 | 0 |
| ENSG00000075340.23 | ADD2 | 1480.0785 | -1.667556 | 0.0884478 | -18.85355 | 0 | 0 |
| ENSG00000166851.15 | PLK1 | 1053.5076 | -2.086683 | 0.1120850 | -18.61697 | 0 | 0 |
| ENSG00000170312.17 | CDK1 | 1975.6485 | -2.261194 | 0.1225361 | -18.45328 | 0 | 0 |
| ENSG00000145386.11 | CCNA2 | 690.5967 | -2.305740 | 0.1268437 | -18.17780 | 0 | 0 |
| ENSG00000091436.17 | MAP3K20 | 2183.2566 | -1.279376 | 0.0704700 | -18.15489 | 0 | 0 |
| ENSG00000169607.13 | CKAP2L | 677.1582 | -2.556814 | 0.1410897 | -18.12191 | 0 | 0 |
| ENSG00000113368.12 | LMNB1 | 1328.9317 | -1.644479 | 0.0915160 | -17.96931 | 0 | 0 |
| ENSG00000157456.8 | CCNB2 | 930.2537 | -2.468799 | 0.1378747 | -17.90610 | 0 | 0 |
| ENSG00000184992.13 | BRI3BP | 1199.6077 | -1.414575 | 0.0790957 | -17.88436 | 0 | 0 |
| ENSG00000121957.15 | GPSM2 | 1225.4696 | -1.801299 | 0.1010701 | -17.82226 | 0 | 0 |
n_genes_up <- dsr %>%
filter(padj < MAX_QVAL, log2FoldChange >= MIN_L2FC) %>%
nrow()
n_genes_down <- dsr %>%
filter(padj < MAX_QVAL, log2FoldChange <= -MIN_L2FC) %>%
nrow()
dsr <- results(DESeq.ds[,1:6], saveCols=c("Geneid", "gene_name"),
contrast=c(FOI, LEVEL_1, LEVEL_0),
independentFiltering=TRUE) %>%
as_tibble() %>%
select(Geneid, gene_name, everything()) %>%
arrange(pvalue)
dsr %>%
filter(log2FoldChange >= MIN_L2FC) %>%
head(n=N_TOP_GENES) %>%
knitr::kable()
| Geneid | gene_name | baseMean | log2FoldChange | lfcSE | stat | pvalue | padj |
|---|---|---|---|---|---|---|---|
| ENSG00000112096.19 | SOD2 | 4321.1657 | 2.079657 | 0.0657574 | 31.62619 | 0 | 0 |
| ENSG00000101846.9 | STS | 826.9620 | 2.720844 | 0.0889568 | 30.58614 | 0 | 0 |
| ENSG00000187837.4 | H1-2 | 3935.7541 | 1.888783 | 0.0638229 | 29.59414 | 0 | 0 |
| ENSG00000070610.14 | GBA2 | 2461.9509 | 1.774416 | 0.0766159 | 23.15990 | 0 | 0 |
| ENSG00000161011.20 | SQSTM1 | 13279.3559 | 1.210689 | 0.0533445 | 22.69570 | 0 | 0 |
| ENSG00000100906.11 | NFKBIA | 5402.5796 | 1.149527 | 0.0508256 | 22.61709 | 0 | 0 |
| ENSG00000131069.20 | ACSS2 | 6818.8524 | 1.291155 | 0.0587479 | 21.97789 | 0 | 0 |
| ENSG00000234745.13 | HLA-B | 7069.6583 | 1.229550 | 0.0564918 | 21.76512 | 0 | 0 |
| ENSG00000184678.11 | H2BC21 | 1071.0992 | 1.605273 | 0.0739799 | 21.69875 | 0 | 0 |
| ENSG00000172893.18 | DHCR7 | 8449.2750 | 1.158587 | 0.0538409 | 21.51873 | 0 | 0 |
| ENSG00000180573.10 | H2AC6 | 1473.7340 | 1.764880 | 0.0822612 | 21.45460 | 0 | 0 |
| ENSG00000162804.14 | SNED1 | 394.3964 | 3.012808 | 0.1411035 | 21.35176 | 0 | 0 |
| ENSG00000158406.6 | H4C8 | 692.0241 | 2.415821 | 0.1135602 | 21.27347 | 0 | 0 |
| ENSG00000099194.6 | SCD | 37572.2804 | 1.306388 | 0.0634102 | 20.60218 | 0 | 0 |
| ENSG00000105339.11 | DENND3 | 1043.7403 | 1.451382 | 0.0704907 | 20.58969 | 0 | 0 |
| ENSG00000158373.9 | H2BC5 | 631.0605 | 1.862091 | 0.0915798 | 20.33298 | 0 | 0 |
| ENSG00000137642.13 | SORL1 | 2857.3857 | 1.223134 | 0.0631875 | 19.35723 | 0 | 0 |
| ENSG00000134824.14 | FADS2 | 17275.7960 | 1.139829 | 0.0606336 | 18.79864 | 0 | 0 |
| ENSG00000204386.12 | NEU1 | 1864.4644 | 1.282594 | 0.0684657 | 18.73337 | 0 | 0 |
| ENSG00000113916.18 | BCL6 | 574.4218 | 1.753265 | 0.0956102 | 18.33764 | 0 | 0 |
| ENSG00000125826.21 | RBCK1 | 5982.2478 | 1.734707 | 0.0960928 | 18.05242 | 0 | 0 |
| ENSG00000075223.14 | SEMA3C | 3295.4722 | 1.122806 | 0.0622329 | 18.04200 | 0 | 0 |
| ENSG00000198888.2 | MT-ND1 | 31816.1881 | 1.140324 | 0.0639381 | 17.83481 | 0 | 0 |
| ENSG00000126709.16 | IFI6 | 2385.4880 | 2.143352 | 0.1204181 | 17.79925 | 0 | 0 |
| ENSG00000167508.12 | MVD | 2848.0057 | 1.494505 | 0.0842575 | 17.73735 | 0 | 0 |
dsr %>%
filter(log2FoldChange <= -MIN_L2FC) %>%
head(n=N_TOP_GENES) %>%
knitr::kable()
| Geneid | gene_name | baseMean | log2FoldChange | lfcSE | stat | pvalue | padj |
|---|---|---|---|---|---|---|---|
| ENSG00000187908.20 | DMBT1 | 507.6636 | -2.629646 | 0.1035662 | -25.39096 | 0 | 0 |
| ENSG00000164611.13 | PTTG1 | 1162.0332 | -2.031179 | 0.0836420 | -24.28419 | 0 | 0 |
| ENSG00000089685.15 | BIRC5 | 1933.0627 | -2.260010 | 0.0940635 | -24.02643 | 0 | 0 |
| ENSG00000147689.17 | FAM83A | 5063.4695 | -2.240927 | 0.0985256 | -22.74461 | 0 | 0 |
| ENSG00000169679.15 | BUB1 | 1500.5314 | -2.558618 | 0.1159074 | -22.07467 | 0 | 0 |
| ENSG00000100526.21 | CDKN3 | 971.6392 | -2.028783 | 0.0925965 | -21.90992 | 0 | 0 |
| ENSG00000134057.15 | CCNB1 | 2373.3330 | -2.225994 | 0.1033099 | -21.54677 | 0 | 0 |
| ENSG00000068489.13 | PRR11 | 1750.1396 | -2.316519 | 0.1087562 | -21.30012 | 0 | 0 |
| ENSG00000117399.14 | CDC20 | 1834.1577 | -2.534666 | 0.1194623 | -21.21729 | 0 | 0 |
| ENSG00000109971.14 | HSPA8 | 25437.1401 | -1.055817 | 0.0540125 | -19.54765 | 0 | 0 |
| ENSG00000108602.18 | ALDH3A1 | 431.5669 | -2.041205 | 0.1046662 | -19.50204 | 0 | 0 |
| ENSG00000087586.18 | AURKA | 1872.2675 | -2.144478 | 0.1109483 | -19.32862 | 0 | 0 |
| ENSG00000126787.13 | DLGAP5 | 1753.5930 | -2.683979 | 0.1394519 | -19.24664 | 0 | 0 |
| ENSG00000240476.1 | LINC00973 | 522.3614 | -1.895641 | 0.0988802 | -19.17109 | 0 | 0 |
| ENSG00000080824.19 | HSP90AA1 | 36155.1540 | -1.308070 | 0.0690522 | -18.94321 | 0 | 0 |
| ENSG00000075340.23 | ADD2 | 1480.0785 | -1.667556 | 0.0884478 | -18.85355 | 0 | 0 |
| ENSG00000166851.15 | PLK1 | 1053.5076 | -2.086683 | 0.1120850 | -18.61697 | 0 | 0 |
| ENSG00000170312.17 | CDK1 | 1975.6485 | -2.261194 | 0.1225361 | -18.45328 | 0 | 0 |
| ENSG00000145386.11 | CCNA2 | 690.5967 | -2.305740 | 0.1268437 | -18.17780 | 0 | 0 |
| ENSG00000091436.17 | MAP3K20 | 2183.2566 | -1.279376 | 0.0704700 | -18.15489 | 0 | 0 |
| ENSG00000169607.13 | CKAP2L | 677.1582 | -2.556814 | 0.1410897 | -18.12191 | 0 | 0 |
| ENSG00000113368.12 | LMNB1 | 1328.9317 | -1.644479 | 0.0915160 | -17.96931 | 0 | 0 |
| ENSG00000157456.8 | CCNB2 | 930.2537 | -2.468799 | 0.1378747 | -17.90610 | 0 | 0 |
| ENSG00000184992.13 | BRI3BP | 1199.6077 | -1.414575 | 0.0790957 | -17.88436 | 0 | 0 |
| ENSG00000121957.15 | GPSM2 | 1225.4696 | -1.801299 | 0.1010701 | -17.82226 | 0 | 0 |
n_genes_up_DU145 <- dsr %>%
filter(padj < MAX_QVAL, log2FoldChange >= MIN_L2FC) %>%
nrow()
n_genes_down_DU145 <- dsr %>%
filter(padj < MAX_QVAL, log2FoldChange <= -MIN_L2FC) %>%
nrow()
dsr <- results(DESeq.ds[,7:12], saveCols=c("Geneid", "gene_name"),
contrast=c(FOI, LEVEL_1, LEVEL_0),
independentFiltering=TRUE) %>%
as_tibble() %>%
select(Geneid, gene_name, everything()) %>%
arrange(pvalue)
dsr %>%
filter(log2FoldChange >= MIN_L2FC) %>%
head(n=N_TOP_GENES) %>%
knitr::kable()
| Geneid | gene_name | baseMean | log2FoldChange | lfcSE | stat | pvalue | padj |
|---|---|---|---|---|---|---|---|
| ENSG00000112096.19 | SOD2 | 4321.1657 | 2.079657 | 0.0657574 | 31.62619 | 0 | 0 |
| ENSG00000101846.9 | STS | 826.9620 | 2.720844 | 0.0889568 | 30.58614 | 0 | 0 |
| ENSG00000187837.4 | H1-2 | 3935.7541 | 1.888783 | 0.0638229 | 29.59414 | 0 | 0 |
| ENSG00000070610.14 | GBA2 | 2461.9509 | 1.774416 | 0.0766159 | 23.15990 | 0 | 0 |
| ENSG00000161011.20 | SQSTM1 | 13279.3559 | 1.210689 | 0.0533445 | 22.69570 | 0 | 0 |
| ENSG00000100906.11 | NFKBIA | 5402.5796 | 1.149527 | 0.0508256 | 22.61709 | 0 | 0 |
| ENSG00000131069.20 | ACSS2 | 6818.8524 | 1.291155 | 0.0587479 | 21.97789 | 0 | 0 |
| ENSG00000234745.13 | HLA-B | 7069.6583 | 1.229550 | 0.0564918 | 21.76512 | 0 | 0 |
| ENSG00000184678.11 | H2BC21 | 1071.0992 | 1.605273 | 0.0739799 | 21.69875 | 0 | 0 |
| ENSG00000172893.18 | DHCR7 | 8449.2750 | 1.158587 | 0.0538409 | 21.51873 | 0 | 0 |
| ENSG00000180573.10 | H2AC6 | 1473.7340 | 1.764880 | 0.0822612 | 21.45460 | 0 | 0 |
| ENSG00000162804.14 | SNED1 | 394.3964 | 3.012808 | 0.1411035 | 21.35176 | 0 | 0 |
| ENSG00000158406.6 | H4C8 | 692.0241 | 2.415821 | 0.1135602 | 21.27347 | 0 | 0 |
| ENSG00000099194.6 | SCD | 37572.2804 | 1.306388 | 0.0634102 | 20.60218 | 0 | 0 |
| ENSG00000105339.11 | DENND3 | 1043.7403 | 1.451382 | 0.0704907 | 20.58969 | 0 | 0 |
| ENSG00000158373.9 | H2BC5 | 631.0605 | 1.862091 | 0.0915798 | 20.33298 | 0 | 0 |
| ENSG00000137642.13 | SORL1 | 2857.3857 | 1.223134 | 0.0631875 | 19.35723 | 0 | 0 |
| ENSG00000134824.14 | FADS2 | 17275.7960 | 1.139829 | 0.0606336 | 18.79864 | 0 | 0 |
| ENSG00000204386.12 | NEU1 | 1864.4644 | 1.282594 | 0.0684657 | 18.73337 | 0 | 0 |
| ENSG00000113916.18 | BCL6 | 574.4218 | 1.753265 | 0.0956102 | 18.33764 | 0 | 0 |
| ENSG00000125826.21 | RBCK1 | 5982.2478 | 1.734707 | 0.0960928 | 18.05242 | 0 | 0 |
| ENSG00000075223.14 | SEMA3C | 3295.4722 | 1.122806 | 0.0622329 | 18.04200 | 0 | 0 |
| ENSG00000198888.2 | MT-ND1 | 31816.1881 | 1.140324 | 0.0639381 | 17.83481 | 0 | 0 |
| ENSG00000126709.16 | IFI6 | 2385.4880 | 2.143352 | 0.1204181 | 17.79925 | 0 | 0 |
| ENSG00000167508.12 | MVD | 2848.0057 | 1.494505 | 0.0842575 | 17.73735 | 0 | 0 |
dsr %>%
filter(log2FoldChange <= -MIN_L2FC) %>%
head(n=N_TOP_GENES) %>%
knitr::kable()
| Geneid | gene_name | baseMean | log2FoldChange | lfcSE | stat | pvalue | padj |
|---|---|---|---|---|---|---|---|
| ENSG00000187908.20 | DMBT1 | 507.6636 | -2.629646 | 0.1035662 | -25.39096 | 0 | 0 |
| ENSG00000164611.13 | PTTG1 | 1162.0332 | -2.031179 | 0.0836420 | -24.28419 | 0 | 0 |
| ENSG00000089685.15 | BIRC5 | 1933.0627 | -2.260010 | 0.0940635 | -24.02643 | 0 | 0 |
| ENSG00000147689.17 | FAM83A | 5063.4695 | -2.240927 | 0.0985256 | -22.74461 | 0 | 0 |
| ENSG00000169679.15 | BUB1 | 1500.5314 | -2.558618 | 0.1159074 | -22.07467 | 0 | 0 |
| ENSG00000100526.21 | CDKN3 | 971.6392 | -2.028783 | 0.0925965 | -21.90992 | 0 | 0 |
| ENSG00000134057.15 | CCNB1 | 2373.3330 | -2.225994 | 0.1033099 | -21.54677 | 0 | 0 |
| ENSG00000068489.13 | PRR11 | 1750.1396 | -2.316519 | 0.1087562 | -21.30012 | 0 | 0 |
| ENSG00000117399.14 | CDC20 | 1834.1577 | -2.534666 | 0.1194623 | -21.21729 | 0 | 0 |
| ENSG00000109971.14 | HSPA8 | 25437.1401 | -1.055817 | 0.0540125 | -19.54765 | 0 | 0 |
| ENSG00000108602.18 | ALDH3A1 | 431.5669 | -2.041205 | 0.1046662 | -19.50204 | 0 | 0 |
| ENSG00000087586.18 | AURKA | 1872.2675 | -2.144478 | 0.1109483 | -19.32862 | 0 | 0 |
| ENSG00000126787.13 | DLGAP5 | 1753.5930 | -2.683979 | 0.1394519 | -19.24664 | 0 | 0 |
| ENSG00000240476.1 | LINC00973 | 522.3614 | -1.895641 | 0.0988802 | -19.17109 | 0 | 0 |
| ENSG00000080824.19 | HSP90AA1 | 36155.1540 | -1.308070 | 0.0690522 | -18.94321 | 0 | 0 |
| ENSG00000075340.23 | ADD2 | 1480.0785 | -1.667556 | 0.0884478 | -18.85355 | 0 | 0 |
| ENSG00000166851.15 | PLK1 | 1053.5076 | -2.086683 | 0.1120850 | -18.61697 | 0 | 0 |
| ENSG00000170312.17 | CDK1 | 1975.6485 | -2.261194 | 0.1225361 | -18.45328 | 0 | 0 |
| ENSG00000145386.11 | CCNA2 | 690.5967 | -2.305740 | 0.1268437 | -18.17780 | 0 | 0 |
| ENSG00000091436.17 | MAP3K20 | 2183.2566 | -1.279376 | 0.0704700 | -18.15489 | 0 | 0 |
| ENSG00000169607.13 | CKAP2L | 677.1582 | -2.556814 | 0.1410897 | -18.12191 | 0 | 0 |
| ENSG00000113368.12 | LMNB1 | 1328.9317 | -1.644479 | 0.0915160 | -17.96931 | 0 | 0 |
| ENSG00000157456.8 | CCNB2 | 930.2537 | -2.468799 | 0.1378747 | -17.90610 | 0 | 0 |
| ENSG00000184992.13 | BRI3BP | 1199.6077 | -1.414575 | 0.0790957 | -17.88436 | 0 | 0 |
| ENSG00000121957.15 | GPSM2 | 1225.4696 | -1.801299 | 0.1010701 | -17.82226 | 0 | 0 |
n_genes_up_PC3 <- dsr %>%
filter(padj < MAX_QVAL, log2FoldChange >= MIN_L2FC) %>%
nrow()
n_genes_down_PC3 <- dsr %>%
filter(padj < MAX_QVAL, log2FoldChange <= -MIN_L2FC) %>%
nrow()
Additionally, an assessment of the magnitude of the changes observed can also be important for the interpetation of experimental data. Generally speaking, more dramatic changes are more obviously significant to your analysis, and are easier to trust when assessing how findings should impact your hypothesis.
plotMA(DGE.results, alpha = 0.05,
main = "Test: p.adj.value < 0.05", ylim = c(-4,4))
plotMA(DGE.results.DU.split, alpha = 0.05,
main = "DU Test: p.adj.value < 0.05", ylim = c(-4,4))
plotMA(DGE.results.PC.split, alpha = 0.05,
main = "PC Test: p.adj.value < 0.05", ylim = c(-4,4))
As can be seen in the results of the three MA-plots, there are many exceedingly significant p-values observed in all three datasets, with some going beyond 4 log fold change. In general, fewer of these are observed from our PC3 results, but they are also fewer in number, which is to be expected from the experimental findings from Sun et al.
#BiocManager::install("EnhancedVolcano")
library(EnhancedVolcano)
## Loading required package: ggrepel
vp1 <- EnhancedVolcano(DGE.results,
lab = first_cols[first_cols[,1] %in% rownames(DGE.results),7],
x = 'log2FoldChange',
y = 'padj', pCutoff = 0.05,
title = "WT / SPINK")
print(vp1)
#BiocManager::install("EnhancedVolcano")
vp1.DU <- EnhancedVolcano(DGE.results.DU.split,
lab = first_cols[first_cols[,1] %in% rownames(DGE.results.DU.split),7],
x = 'log2FoldChange',
y = 'padj', pCutoff = 0.05,
title = "DU145 - WT / SPINK")
## Warning: One or more p-values is 0. Converting to 10^-1 * current lowest non-
## zero p-value...
print(vp1.DU)
#BiocManager::install("EnhancedVolcano")
vp1.PC <- EnhancedVolcano(DGE.results.PC.split,
lab = first_cols[first_cols[,1] %in% rownames(DGE.results.PC.split),7],
x = 'log2FoldChange',
y = 'padj', pCutoff = 0.05,
title = "PC3 - WT / SPINK")
print(vp1.PC)
Too late did I realize that in a secondary notebook I had made some very obvious changes to the design of my results for my primary DESeq object, given that it is very important to include the contrast statement to properly handle the interaction term to derive biologically meaningful comparisons.
res = results(DESeq.ds, contrast=c("condition", "WT", "SPINK1"))
ix = which.min(res$padj) # most significant
res <- res[order(res$padj),] # sort
#names(res) <- first_cols[first_cols[,1] %in% rownames(res)]
res[1:5,-(3:4)]
barplot(assay(DESeq.ds)[ix,],las=2, main=first_cols[first_cols[,1] %in% rownames(DGE.results.shrink),7][ ix ] )
res <- results(DESeq.ds, list( c("condition_SPINK1_vs_WT","celltypePC3.conditionSPINK1") ))
ix = which.min(res$padj) # most significant
res <- res[order(res$padj),] # sort
res[1:5,-(3:4)]
barplot(assay(DESeq.ds)[ix,],las=2, main=first_cols[first_cols[,1] %in% rownames(DGE.results.shrink),7][ ix ] )
res = results(DESeq.ds, contrast=c("celltype","DU145","PC3"))
ix = which.min(res$padj) # most significant
res <- res[order(res$padj),] # sort
res[1:5,-(3:4)]
barplot(assay(DESeq.ds)[ix,],las=2, main=first_cols[first_cols[,1] %in% rownames(DGE.results.shrink),7][ ix ] )
res = results(DESeq.ds, name="celltypePC3.conditionSPINK1")
ix = which.min(res$padj) # most significant
res <- res[order(res$padj),] # sort
res[1:5,-(3:4)]
barplot(assay(DESeq.ds)[ix,],las=2, main=first_cols[first_cols[,1] %in% rownames(DGE.results.shrink),7][ ix ] )
There are many, many areas for improvement as well. The data provided generated an insufficient number of duplicates in order to ensure that the read counts generated from the RNA-seq experiments underlying this analysis could be accurately assessed (ideally there should be 6 duplicates ) [Love et al., 2014; Stephens 2017]. Further, initial screening of the data could have avoided a lot of heartbreak, as there was reason to believe that the data was slightly contaminated with excess primers and adapters, which could have hampered results. Additional close attention might have also been paid to the descriptions and methods for analysis described on the GEO database as well, as buried in the files were some very sensible suggestions that I might have also implemented for my own analysis (I disocvered these buried in the sample files far too late to attempt learning/utilizing bowtie, and so on.) It should also be important to note that due to the unexpectedly heavy course and professional workload of the author of this report, that some of the concepts necessary for the smooth enactment of this analysis were often learned on the fly, while reviewing notes and asking many questions to both Luce and Mervin Fansler. Further analysis should be performed to determine the source or underlying mechanism explaining their limitless patience and offerings of their own free time. Over the course of constructing this project, the author began to develop a genuine enjoyment of the process of NGS analysis, end to end, and only wishes he had slightly more time to see things through and get it right. That all being said, these areas for improvement would only be worth investing time into on the assumption that SPINK1 might indeed be a suitable biomarker for the treatment and prognostic assessment of the TME. In this analysis, we have demonstrated the viability of this candidate by demonstrating its pronounced affect on the expression of PCa cells, and as such, all haste should be made to follow up this research taking both its findings and the harsh lessons learned forward towards further analysis.
| Dataset.Type | Section Generated |
|---|---|
| FeatureCountsS2.txt | featureCounts |
| DESeq.ds | DESeq2 |
| n_genes_up / n_genes_down | R |